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Scalability of Parallel Batch Pattern Neural 
Network Training Algorithm 


The development of parallel batch pattern back propagation training algorithm of multilayer perceptron and 
its scalability research on general-purpose parallel computer are presented in this paper. The model of multilayer 
perceptron and batch pattern training algorithm are theoretically described. The algorithmic description of the 
parallel batch pattern training method is presented. The scalability research of the developed parallel algorithm is 
fulfilled at progressive increasing the dimension of the parallelized problem on general-purpose parallel 
computer NEC TX-7. 


Introduction 


Artificial neural networks (NNs) have excellent abilities to model difficult nonlinear 
systems. They represent a very good alternative to traditional methods for solving complex 
problems in many fields, including image processing, predictions, pattern recognition, robotics, 
optimization, etc [1]. However, most NN models require high computational load, especially 
in the training phase (up to days and weeks). This is, indeed, the main obstacle in front of 
an efficient use of NNs in real-world applications. Taking into account the parallel nature 
of NNs, many researchers have already focused their attention on its parallelization [2-4]. 
But the most of the existing parallelization approaches are based on the specialized computing 
hardware and transputers, which are capable to fulfill the specific neural operations more 
quickly than general-purpose parallel and high performance computers. However computational 
clusters and Grids have gained tremendous popularity in computation science during last 
decade [5]. Computational Grids are considered as heterogeneous systems, which may include 
high performance computers with parallel architecture and computational clusters based on 
standard PCs. Therefore existing solutions of NNs parallelization on transputer architectures 
should be re-designed and its parallelization efficiency should be verified on general-pur- 
pose parallel and high performance computers in order to provide its efficient usage within 
computational Grid systems. 

Many researchers already have developed parallel algorithms of NNs training on weights 
(connections), neuron (node), training set (pattern) and modular levels [6-10]. Connection 
parallelism (parallel execution on sets of weights) and node parallelism (parallel execution 
of operations on sets of neurons) schemes are not efficient while executing on the general- 
purpose high performance computer due to high synchronization and communication overhead 
among parallel processors [10]. Therefore coarse-grain approaches of pattern and modular 
parallelism should be used to parallelize NNs training on general-purpose parallel computers 
and computational Grids [9]. For example, one of the existing implementation of the batch 
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pattern training algorithm [6] has good efficiency of 80 % while executing on 10 processors 
of transputer TMBO08, however the efficiency of this algorithm on general-purpose high- 
performance computers is not researched yet. 

The goal of this paper is to research the scalability of parallel batch pattern neural 
network training algorithm on general-purpose parallel computer in order to form the re- 
commendations for further usage of this algorithm on heterogeneous Grid system. The sca- 
lability of parallel algorithm is considered as its ability to maintain the same parallelization 
efficiency when we progressively increase both the dimension of the parallelization problem 
and the number of processors of parallel machine [11]. 


1. Architecture of Multilayer Perceptron and Batch Pattern 
Training Algorithm 


It is expedient to research parallelization of multi-layer perceptron because this kind 
of NN has the advantage of being simple and provides very good generalized properties. 
Therefore it is often used for many practical tasks including prediction, recognition, opti- 
mization and control [1]. However a parallelization of single multi-layer perceptron with 
standard sequential back propagation training algorithm does not provide good parallelization 
efficiency [10] due to high synchronization and communication overhead among parallel 
processors. Therefore it is expedient to use batch pattern training algorithm, which provides 
changing neurons’ weights and thresholds in the end of each training epoch, i.e. after presenting 
all training patterns on the input and output of perceptron in the training mode. 

The output value of the three-layer perceptron (Fig. 1) can be formulated as: 


yA Sm,(m(Smei-7))-7), (1) 


where N is the number of neurons in the hidden layer; w,, is the weight of the synapse 
from neuron j of the hidden layer to the output neuron; w, are the weights from the input 
neurons to neuron / in the hidden layer; x, are the input values; 7; are the thresholds of the 
neurons of the hidden layer and 7 is the threshold of the output neuron [1], [12]. The logistic 


Sin Se | re : 
activation function F(x) = ia is used for the neurons of the hidden ( F, ) and output 
+e 


layer (F;). 


Figure | — The structure of three-layer perceptron 
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The back propagation batch pattern training algorithm consists of the following steps [12]: 

1. Set the desired value of total Sum-Squared Error (SSE) to E,,. and the number of 
training iterations f. 

2. Initialize the weights and the thresholds of the neurons by values in the range 
(0,...0,5) [12]. 

3. For the training pattern pt: 


min 


3.1. Calculate the output value y”(t) using expression (1). 

3.2. Calculate the error of the output neuron y?"(t) = y”(t)—d” (t), where y”(t) 
is the output value of the perceptron and d”(t) is the target output value. 

3.3. Calculate the error of the hidden layer neurons y(t) = 77" (t)-w3(t)- Fy(S”(0)), 


where S’"(t) is the weighted sum of the output neuron. 
3.4. Calculate the delta weights and delta thresholds of all perceptron’s neurons 
and add the result to the value of the previous pattern sAw,, = sAw,, + 73"(t)- Fy(S” (0) -h?"(), 


sAT =sAT +70): FS (0), sow, =shoy, +7"(0-F(S"(O)-80, SAT, =sAT, +7") ES"), 
where S(t) and h7"(t) are the weighted sum and the output value of the j hidden neu- 
ron respectively. 


3.5. Calculate the SSE using E(t) = sv" -d"(t)) . 


4. Repeat the step 3 above for each training pattern pt, where pt¢€ {L,...,PT ts PT is 
the size of the training set. 
5. Update the weights and the thresholds of all neurons using: 
w, (PT) = w,(0)- a(t): sAw,, T)(PT)=T,(0) + a(t): sAT,, 


ij? 
where a(t) is the learning rate. 


PT 
6. Calculate the total SSE E(t) on the training iteration ¢ using E(t) = YE EG): 
pt=l 
7. If E(t) is greater than the desired error £,,;, then increase the number of training 
iteration to ¢+1 and go to step 3, otherwise stop the training process. 


2. Parallel Back Propagation Batch Pattern 
Training Algorithm 


It is obvious from the analysis of the batch training algorithm from Section 1 above, 
that sequential execution the points 3.1 — 3.5 for all training patterns in the training set 


could be transformed to parallel execution, because the sum operations sAw, and sA7, are 


independent on each other. For development of the parallel algorithm it is necessary to divi- 
de all computational job among the Master (executing assigning functions and calculations) 
and the Slaves (executing only calculations) processors. 

The algorithms of functioning the Master and the Slave processors are depicted in 
Fig. 2a and Fig. 2b respectively. The Master starts with definition (i) the number of patterns 
PT in the training data set and (ii) the number of processors p used for parallel executing 
the training algorithm. The Master divides all patterns in equal parts corresponding to num- 
ber of Slaves and assigns one part of patterns to himself. Then the Master sends to the 
Slaves the numbers of the appropriate patterns to train. 
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Each Slave executes the following operations for each of pt patterns: 
— calculation the points 3.1 — 3.5 of the algorithm from Section 1 above, the point 4 is execu- 
ted only for assigned number of training patterns. The values of partial sums of delta weights 


sAw, and delta thresholds sA7;, are calculated as a result of this step; 


— to calculate partial SSE for assigned number of training patterns. 
After processing all assigned patterns each Slave is waiting other Slaves and the Mas- 
ter in the synchronization point. At the same time the Master executes own (assigned to 


himself) number of training patterns and calculates own partial values of delta weights sAw, 
and delta thresholds sAT;,. 


Read the input data 


Receive PT/(p — 1) 
patterns from Master 


Read the input data 


Define PT and p 


Send to Slaves 


PT/(p — 1) patterns Calculate p.3 and p.4 for 


assigned training patterns 


Calculate p.3 and p.4 for 
own training patterns 


Synchronization with 
other Slaves and Master 


Synchronization with 
other Slaves 


Reduce and Sum sAw, , 


sAT,, E(t) from all 


Reduce and Sum sAw, > Slaves and Master 


sAT,, E(t) from all 


Slaves and send it back 
to all Slaves 


Update w,,, 7; 


y ° 
according to p.5 


Update Wy T, 


according to p.5 b) 


a) 
Figure 2 — The algorithms of Master (a) and Slave (b) processors 


The global reducing operation with summation is executing just after synchronization 


point. Then the summarized values of sAw, and sAT; are sending to all processors working 


in parallel. Using global reducing operation with simultaneous returning the reduced values 
back to the senders allows decreasing the time overhead in the synchronization point. Then 


the summarized values of sAw, and sA7, are placed into the local memory of each pro- 


cessor. Each Slave and the Master use these values sAw, and sAT, in order to update the 
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weights and thresholds according to the point 5 of the algorithm. These updated weights 
and thresholds will be used on the next iteration of the training algorithm. Since the sum- 
marized value of E(t) also is received in a result of reducing, the Master executes the ope- 
ration from the point 7 of the algorithm, i.e. decides to continue the training or not. 

The software routine is developed using C programming language using standard 
MPI library. The parallel part of the algorithm starts with the call of the MPI _Init() function. 
The parallel processors use the synchronization point MPI_Barrier(). The reducing of the 
deltas of weights sAw, and thresholds sAT, are provided by function MPI_Allreduce() which 


allow to avoid additional step of sending the updated weights and thresholds from the Mas- 
ter to each Slave back. Function MPI_Finalize() finishes the parallel part of the algorithm. 


3. Experimental researches 


The parallel computer NEC TX-7, placed in the Center of Excellence of High Performance 
Computing, University of Calabria, Italy (www.hpcc.unical.it), is used for experimental 
research of developed parallel algorithm. NEC TX-7 consists of 4 identical units. Each unit 
has 4 Gb RAM, 4 64-bit processors Intel Itanium2 with clock rate of 1 GHz. This 16'"-pro- 
cessor computer with 64 Gb of total RAM has a peak performance of 64 MFLOPS. Com- 
puter NEC TX-7 is functioning under Linux operation system. 

It is expedient to form the research scenarios of increasing the dimension of parallelized 
problem in order to research parallelization efficiency according to these scenarios. The quality 
of perceptron training is described by achieved value of sum-squared error SSE, which 
should be provided in the result of training. Therefore the number of training epochs could 
be considered as an input parameter to form the research scenarios, which provide different 
SSEs. The task of prediction and predicting multilayer perceptron with 5 input, 10 hidden and 
1 output neurons are used for research. The neurons of hidden and output layer have logistic 
activation function. It is used 794 training patterns in the training data set and 482 patterns 
in the prediction data set. The number of training epochs is changed from 10000 to 10° du- 
ring the research. The learning rates of perceptron’s hidden and output layers are constant 
and equal a(t) = 0,01. The parameters of scenarios fulfillment are presented in the Table 1. 


Table 1 — Parameters of research scenarios 


Scenario | Number of | Reached Time of | Time of parallel | Relative error 
iteration SSE sequential execution on | of prediction, 
execution, 1 processor, % 
seconds seconds 
Scenario | 10000 2,9850 13,71 13,06 12,3 
Scenario 2 100000 0,4391 137,08 130,79 4,7 
Scenario 3 500000 0,2228 685,49 653 1,0 
Scenario 4 | 1000000 0,1626 1371,00 1307,94 0,1 


As it is seen from the Table 1, the perceptron provides good training ability, the SSE 
is changed from 2,98 till 0,16 and relative error of prediction is changed from 12,3 % till 
0,1 %. The difference between the execution time of the sequential routine and the execution 
time of the parallel routine on 1 processor of the NEC TX-7 is within 5 %. The execution time 
has linear increasing which is caused by the certain execution time of one training epoch. 

The execution time of parallel batch training algorithm on 2, 4 and 8 processors of the 
NEC TX-7 is presented in Table 2. The speedup S = Ts/Tp and efficiency E = S/p x 100 % 
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of parallelization is researched on 2, 4 and 8 processors, where Ts is the time of sequential 
executing the routine, 7p is the time of parallel executing of the same routine on p proces- 
sors of parallel computer. 


Table 2 — Execution time, speedup and efficiency of parallelization 


Scenarios Execution time (seconds) on processors 

2 4 8 
Scenario | 6,85 3,90 2,50 
Scenario 2 68,52 39,06 25,01 
Scenario 3 342,23 195,31 129,04 
Scenario 4 685,30 390,12 250,35 

Speedup on processors 

2 4 8 
Scenario | 1,9066 3,3487 5,224 
Scenario 2 1,9088 3,3484 5,2295 
Scenario 3 1,9080 3,3434 5,0604 
Scenario 4 1,9086 ByS027 5,2244 

Efficiency on processors, % 

2 4 8 
Scenario | 95 84 65 
Scenario 2 95 84 65 
Scenario 3 95 84 63 
Scenario 4 95 84 65 


As it is seen from the results, the parallel batch back propagation training algorithm 
of multilayer perceptron provides very good scalability, i.e. provides the same level of paral- 
lelization efficiency at increasing the dimension of the parallelization problem. The efficiencies 
of parallelization of this algorithm are 95 %, 84 % and 63 % on 2, 4 and 8 processors of 
general-purpose parallel computer NEC TX-7 respectively for the multilayer perceptron 
5-10-1 with 794 training patterns. 


Conclusions 


The parallel batch pattern back propagation training algorithm of multilayer percept- 
ron is developed in this paper. The parallelization efficiency research for the scenarios of 
increasing the training epochs from 10000 to 10° showed very good scalability of parallel 
algorithm. It means that parallelization efficiency of this algorithm does not depend on the 
number of the training epochs. The parallelization efficiencies of parallel batch pattern back 
propagation training algorithm of multilayer perceptron are 95 %, 84 % and 63 % on 2, 4 
and 8 processors of general-purpose computer NEC TX-7 respectively for the multilayer 
perceptron 5-10-1 with 794 training patterns. The provided level of parallelization efficiency 
is enough for using this parallel algorithm in Grid environment on general-purpose parallel 
and high performance computers. For future research it is expedient to estimate the paralle- 
lization efficiency of developed parallel algorithm on the scenarios of changing the archi- 
tecture of multilayer perceptron (number of neurons) as well as the number of the training patterns 
in the input data set. 
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B. Typuenxo 

Macurra6upoBanHoctTh NapaJiesIbHOrO rpyMMOBoro aslropHTMa o6y4eHHA Hei poHHoO ceTH 

PaspaOorka MapasWIesIbHOrO TpyMloBoro aIropuTMa OOy4eHHA OOpaTHoro pacnpoctpaHeHHA OLIMOKM MHOrOcOMHOrO 
TlepcellTpoHa HM McceqOBaHue ero MaCIITaOMpoOBaHHOCTH Ha MapasWIeIbHOM KOMIIbIOTepe OOMero HasHadeHHA 
TIpeACTaBJIeHbI B 9TOM cTaTbe. Moyeub MHOTOcIOMHOrO MepcenTpoua MH TpyiMoBol alropHT™M ero obyyeHuA 
OMMcaHbI (OpMaIM30BaHHBIM OOpa30M. IlapamenbHbli rpynmoBoi anropHTM oOyyeHHA MpeycTaBseH B 
asIrOpHTMHyecKoM Bue. UccneqoBanue MacliTaOupoBaHHOcTH pa3spaboTaHHOroO MapasWIeIbHOrO asIropHTMa 
OCYIJeCTBIICHO JIA MIPOMOPUMOHAIbHO yBeIM4MBaIOLerocd pasMepa 3aau NapaswieiM3alMn Ha WapasieyIbHOM 
KOMIbrOTepe OOMero Ha3HayeHua NEC TX-7. 


B. Typuenxo 

Macurra6opanicTh NapasleIbHOro TpyNOBOro aJIrOpHTMy HABUYAaAHHA HepoHHol Mepexi 

Pospo6ka MapasieyIbHOrO TpyMOBoro aJIFOPHTMY HaBYaHHA 3BOPOTHOTO NOMIMpPeCHHA MOMMJIKH OaraToOuapoBoro 
TlepcenTpoHy Ta OCIPKeHHA Horo MacluTaOoOBaHOCTI Ha HapaJieyIbBHOMy KOMII’}OTepi 3arasIbHOrO TIpH3HaycHHA 
posrisHyTi B wilt cratTi. Moyenb OaraTowlapoBoro MepcentpoHy Ta rpymoBuit alropuTM foro HaB4uaHHA 
onlvicaHi (bopMasi30BaHHM 4HHoM. [apasesbHuit rpynoBHi aIrOpHTM HaBYaHHA IpeyicTaBIeHO B asIrTOpHTMIMHOMy 
Baris. JociisKeHHA MacluTaOOBaHOCT! po3spobseHOrO MapasiesIbHOrO aJIFOpHTMY 3/MCHeHO JIA MpOMOpMiitHo 
301JIBLIyBaHOro posMipy 3aaui Napasemi3zaiii Ha WapasiesIbHOMY KOMI! ’roTepi 3arasIbHoro mpH3HayeHHa NEC 
TX-7. 
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